Four \(x\)-\(y\) datasets which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.), yet are quite different.
anscombe
A data frame with 11 observations on the following 8 variables.
x1 == x2 == x3 the integers 4:14, specially arranged
x4 values 8 and 19
y1, y2, y3, y4 numbers in (3, 12.5) with mean 7.5 and
sdev 2.03
Tufte, Edward R. (1989). The Visual Display of Quantitative Information, 13–14. Graphics Press.
Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899.
require(stats);
require(graphics)
require(knitr)
## Loading required package: knitr
summary(anscombe)
## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
now some “magic” to do the 4 regressions in a loop:
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## or ff[[2]] <- as.name(paste0("y", i))
## ff[[3]] <- as.name(paste0("x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
print(kable(anova(lmi)))
cat('\n')
}
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 27.51000 | 27.510001 | 17.98994 | 0.0021696 |
| Residuals | 9 | 13.76269 | 1.529188 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x2 | 1 | 27.50000 | 27.500000 | 17.96565 | 0.0021788 |
| Residuals | 9 | 13.77629 | 1.530699 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x3 | 1 | 27.47001 | 27.470008 | 17.97228 | 0.0021763 |
| Residuals | 9 | 13.75619 | 1.528466 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x4 | 1 | 27.49000 | 27.490001 | 18.00329 | 0.0021646 |
| Residuals | 9 | 13.74249 | 1.526943 | NA | NA |
See how close they are (numerically!)
sapply(mods, coef)
## lm1 lm2 lm3 lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1 0.5000909 0.500000 0.4997273 0.4999091
lapply(mods, function(fm) coef(summary(fm)))
## $lm1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0000909 1.1247468 2.667348 0.025734051
## x1 0.5000909 0.1179055 4.241455 0.002169629
##
## $lm2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000909 1.1253024 2.666758 0.025758941
## x2 0.500000 0.1179637 4.238590 0.002178816
##
## $lm3
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0024545 1.1244812 2.670080 0.025619109
## x3 0.4997273 0.1178777 4.239372 0.002176305
##
## $lm4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017273 1.1239211 2.670763 0.025590425
## x4 0.4999091 0.1178189 4.243028 0.002164602
Now, do what you should have done in the first place: PLOTS
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13))
abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
par(op)
Just for fun, place the “Monstrous Costs” figure from Healy here (either find an image online, or print screen and crop). Make sure to align the figure to center, add a caption, and have the width set to 40%.
Place here the interactive animated plot from the introduction (you’d
need to install and call the libraries gapminder,
ggplot2 and plotly for it to work). Use the
echo = FALSE option to hide the code.
Without hovering over the markers to show the data that is associated with them, identify a marker that captures your attention (from one of the years). Using the notion of “preattentive search”, try to understand and explain in writing why this particular marker caught your attention. Identify the country that is associated with this marker. Were you surprised? Have you learned something that you didn’t know? Affirmed an intuition? Repeat this exercise, this time with a marker that captures your attention from the animated sequence.
Static Analysis: A marker captures my attention is the one with the highest GDP per capita in 1952. The marker is located in the middle right corner of the plot, and it is the only marker in that region. The country associated with this marker is Kuwait. I was surprised to see Kuwait as the country with the highest GDP per capita in 1952. I did not know that Kuwait had such a high GDP per capita in 1952. The extremely high GDP per capita of Kuwait in 1952 can be attributed to the discovery and exploitation of its vast oil reserves. Kuwait has one of the largest oil reserves in the world, and the oil industry began to significantly impact its economy in the late 1940s and early 1950s.
Animated Sequence Analysis: During the animation, the marker of China showing dramatic improvement in both GDP per capita and life expectancy over time catches my attention. This could represent the country has experienced rapid development and improvement in living standards. Observing such a transformation could highlight the impact of economic development and policy choices on health and well-being.
For four of the seven “gestalt rules” of your choice that are enumerated in page 22, provide an example of the principle in practice in the gapminder plot.
Starting to think: one thing we mentioned is that the gapminder plot does not highlight “inequality” very well. Visualizing inequalities entails visualizing distributions. Suggest a tentative method for highlighting aspects of inequality in these data that you find important. You may look online, refer to the diamonds app from the introduction or use any other source. You are not asked to provide any plots here, this is a teaser thought experiment (you may provide examples for visualizations you find relevant). We shall discuss visualization tools for comparing distributions at length throughout the class.
When considering how to visualize inequality in the context of the Gapminder data, we need to think about ways to represent the distribution of these metrics within each country, rather than just providing an average or a single data point per country. We can try applying several following methods that could be employed to highlight aspects of inequality:
Box-and-Whisker Plots: These plots show the median, quartiles, and extremes of data, which can highlight disparities within and between countries’ income distributions. Having a box-and-whisker plot for the GDP per capita of each country would show how spread out incomes are around the median.
Histograms and Density Plots: These can show the distribution of a single metric, like GDP per capita, across different population segments within a country. They could also be used to compare the distribution of wealth across countries.
Violin Plots: These plots combine the features of box plots and density plots, showing the distribution of a metric across different countries.